{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 机器学习-周志华-习题6.3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.3 选择两个 UCI 数据集，分别用线性核和高斯核训练一个 SVM，并与BP 神经网络和 C4.5 决策树进行实验比较。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里就只用`sklearn`中自带的iris数据集来对比题中几个算法。这里数据集不大，只有150个样本，所以就不拿出额外的样本作为测试集了，进行`5-flod`交叉验证，最后验证集的平均准确率作为评价模型标准。\n",
    "\n",
    "--- \n",
    "- SVM将使用`sklearn.svm`\n",
    "- BP神经网络将使用`Tensorflow`实现\n",
    "- 关于C4.5。Python中貌似没有C4.5的包，在第四章写的决策树代码也并不是严格的C4.5，为了方便这里就还是使用`sklearn`吧。`sklearn`中决策树是优化的CART算法。\n",
    "\n",
    "---\n",
    "此外，各模型都进行了粗略的调参，不过在这里的`notebook`省略了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1、导入相关包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import KFold, train_test_split, cross_val_score, cross_validate\n",
    "from sklearn import svm, tree\n",
    "\n",
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2、数据读入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "iris = datasets.load_iris()\n",
    "X = pd.DataFrame(iris['data'], columns=iris['feature_names'])\n",
    "\n",
    "y = pd.Series(iris['target_names'][iris['target']])\n",
    "# y = pd.get_dummies(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal length (cm)</th>\n",
       "      <th>sepal width (cm)</th>\n",
       "      <th>petal length (cm)</th>\n",
       "      <th>petal width (cm)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
       "0                5.1               3.5                1.4               0.2\n",
       "1                4.9               3.0                1.4               0.2\n",
       "2                4.7               3.2                1.3               0.2\n",
       "3                4.6               3.1                1.5               0.2\n",
       "4                5.0               3.6                1.4               0.2"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3、模型对比"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1 线性核SVM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "linear_svm = svm.SVC(C=1, kernel='linear')\n",
    "linear_scores = cross_validate(linear_svm, X, y, cv=5, scoring='accuracy')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98000000000000009"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "linear_scores['test_score'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2 高斯核SVM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "rbf_svm = svm.SVC(C=1)\n",
    "rbf_scores = cross_validate(rbf_svm, X, y, cv=5, scoring='accuracy')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98000000000000009"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rbf_scores['test_score'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3 BP神经网络"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里BP神经网络使用`tensorflow`实现，其实在`sklearn`中也有（当然在第五章也用`numpy`实现过，也可以用），不过这里因为个人原因还是使用`tensorflow`。。不过事实上如果为了答这道题，使用`sklearn`其实代码量会更少。\n",
    "\n",
    "---\n",
    "`tensorflow`里面没有现成的交叉验证的api（`tensorflow`中虽然也有其他机器学习算法的api，但它主要还是针对深度学习的工具，训练一个深度学习模型常常需要大量的数据，这个时候做交叉验证成本太高，所以深度学习中通常不做交叉验证，这也为什么`tensorflow`没有cv的原因），这里使用 `sklearn.model_selection.KFold`实现BP神经网络的交叉验证。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 定义模型，这里采用一层隐藏层的BP神经网络，神经元个数为16\n",
    "x_input = tf.placeholder('float', shape=[None, 4])\n",
    "y_input = tf.placeholder('float', shape=[None, 3])\n",
    "\n",
    "keep_prob = tf.placeholder('float', name='keep_prob')\n",
    "\n",
    "W1 = tf.get_variable('W1', [4, 16], initializer=tf.contrib.layers.xavier_initializer(seed=0))\n",
    "b1 = tf.get_variable('b1', [16], initializer=tf.contrib.layers.xavier_initializer(seed=0))\n",
    "\n",
    "h1 = tf.nn.relu(tf.matmul(x_input, W1) + b1)\n",
    "h1_dropout = tf.nn.dropout(h1, keep_prob=keep_prob, name='h1_dropout')\n",
    "\n",
    "W2 = tf.get_variable('W2', [16, 3], initializer=tf.contrib.layers.xavier_initializer(seed=0))\n",
    "b2 = tf.get_variable('b2', [3], initializer=tf.contrib.layers.xavier_initializer(seed=0))\n",
    "\n",
    "y_output = tf.matmul(h1_dropout, W2) + b2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 定义训练步骤、准确率等\n",
    "cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=y_output, labels=y_input))\n",
    "\n",
    "train_step = tf.train.AdamOptimizer(0.003).minimize(cost)\n",
    "\n",
    "correct_prediction = tf.equal(tf.argmax(y_output, 1), tf.argmax(y_input, 1))\n",
    "accuracy = tf.reduce_mean(tf.cast(correct_prediction, 'float'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 将目标值one-hot编码\n",
    "y_dummies = pd.get_dummies(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = tf.Session()\n",
    "init = tf.global_variables_initializer()\n",
    "costs = []\n",
    "accuracys = []\n",
    "\n",
    "for train, test in KFold(5, shuffle=True).split(X):\n",
    "    sess.run(init)\n",
    "    X_train = X.iloc[train, :]\n",
    "    y_train = y_dummies.iloc[train, :]\n",
    "    X_test = X.iloc[test, :]\n",
    "    y_test = y_dummies.iloc[test, :]\n",
    "\n",
    "    for i in range(1000):\n",
    "        sess.run(train_step, feed_dict={x_input: X_train, y_input: y_train, keep_prob: 0.3})\n",
    "\n",
    "    test_cost_, test_accuracy_ = sess.run([cost, accuracy],\n",
    "                                          feed_dict={x_input: X_test, y_input: y_test, keep_prob: 1})\n",
    "    accuracys.append(test_accuracy_)\n",
    "    costs.append(test_cost_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.96666664, 0.96666664, 0.96666664, 0.96666664, 0.93333334]\n",
      "0.96\n"
     ]
    }
   ],
   "source": [
    "print(accuracys)\n",
    "print(np.mean(accuracys))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4 CART"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cart_tree = tree.DecisionTreeClassifier()\n",
    "tree_scores = cross_validate(rbf_svm, X, y, cv=5, scoring='accuracy')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'fit_time': array([ 0.00199413,  0.00256157,  0.00185156,  0.00298214,  0.0030067 ]),\n",
       " 'score_time': array([ 0.00099921,  0.00099659,  0.00114751,  0.00106406,  0.        ]),\n",
       " 'test_score': array([ 0.96666667,  1.        ,  0.96666667,  0.96666667,  1.        ]),\n",
       " 'train_score': array([ 0.98333333,  0.98333333,  0.99166667,  0.98333333,  0.975     ])}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tree_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98000000000000009"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tree_scores['test_score'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4 总结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为`iris`数据原因，本身容易区分，这四个模型最终结果来看几乎一致（除了自己拿`tensorflow`写的BP神经网络，验证集上的准确率低了0.02）"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
